【报错篇】报错集锦

参考：http://whatbeg.com/2018/12/05/tensorflowtips.html

类型错误

TypeError: unsupported callable 原因：目标对象不支持[]操作，主要查看目标对象的数据类型
TypeError: <DatasetV1Adapter shapes: ((?, 1), {feature: (?, ?, ?), value: (?, ?, ?)}), types: (tf.int64, {feature: tf.int64, value: tf.int64})> is not a callable object 原因：目标对象不支持[]操作，主要查看目标对象的数据类型
TypeError: Fetch argument None has invalid type <class 'NoneType'>
TypeError: exceptions must derive from BaseException
TypeError: List of Tensors when single Tensor expected 原因：构建tf.constant时，传入的数据只能为np.array格式的数据，而不能为tf下的数据格式

格式错误

TabError: inconsistent use of tabs and spaces in indentation
原因：缩进不统一
ValueError: setting an array element with a sequence.
原因：数据长度不统一
ValueError: Passed Tensor(…) should have graph attribute that is equal to current graph

ValueError: 2 is not a valid task_id for task_type worker in the cluster_spec:

原因：task_index对应task_id超出了worker_hosts的数量

在集群上计算的时候配置clusterSpec中TF_CONFIG环境必须设置chief，因为使用的集群上的tf框架会自动传入worker_hosts，而不传入chief worker，所以取worker0为chief，则在执行worker的时候就自然少1，需要单独配置worker0和其他情况

解决：

# 1 并行计算时配置chief对应的worker也参与计算
if self.params.job_name=='worker' and self.params.task_index==0:
            cluster = {'chief': [worker_hosts[0]],
            'ps': ps_hosts,
            'worker': worker_hosts[1:]}
            os.environ['TF_CONFIG'] = json.dumps(
            {'cluster': cluster,
            'task': {'type': 'chief', 'index': 0}})
else:
            cluster = {'chief': [worker_hosts[0]],
            'ps': ps_hosts,
            'worker': worker_hosts}
            os.environ['TF_CONFIG'] = json.dumps(
            {'cluster': cluster,
        'task': {'type': self.params.job_name, 'index': self.params.task_index}})
# 2 并行计算时，针对参与并行计算的worker配置task_index
cluster = {"chief": [worker_hosts[0]], "ps": ps_hosts, "worker": worker_hosts[1:]}
if self.params.job_name == 'worker' and self.params.task_index == 0:
            os.environ["TF_CONFIG"] = json.dumps({
                "cluster": cluster,
                "task":{
                    "type": 'chief',
                    "index": 0
                }})
elif self.params.job_name == 'worker':
            os.environ["TF_CONFIG"] = json.dumps({
                "cluster": cluster,
                "task":{
                    "type": self.params.job_name,
                    "index": self.params.task_index-1
                }})
else:
            os.environ["TF_CONFIG"] = json.dumps({
                "cluster": cluster,
                "task":{
                    "type": self.params.job_name,
                    "index": self.params.task_index
                }})

匹配错误

InvalidArgumentError (see above for traceback):tensorflow.python.framework.errors_impl.InvalidArgumentError
原因：造成这个问题的原因有很多种，主要看具体的报错日志
Assign requires shapes of both tensors to match.
原因：矩阵乘法时，行列数不匹配
Assign requires shapes of both tensors to match. lhs shape= [557092,300] rhs shape= [500764,300]
InvalidArgumentError (see above for traceback): Incompatible shapes: [200,4096] vs. [200,16777216]

原因：计算auc时出现样本量由4096变4096**2个

解决：tf.reduce_sum(fm_bi, 1)增加, keep_dims=True

reload模型

ResourceExhaustedError (see above for traceback): Cannot allocate memory&Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. 原因：框架配置问题

其他错误

ValueError: Couldn't find trained model at /tmp/tmpsov4rvpc. 原因
tensorflow.python.framework.errors_impl.UnknownError:Input/output error

原因：

解决:
FailedPreconditionError (see above for traceback): Attempting to use uninitialized value precision_at_thresholds/false_positives
terminate called after throwing an instance of 'std::bad_alloc'

原因：一次读取或者计算的数据量较大，造成的内存不够

解决：减少batch_size一次读取的数据量、减少网络层数、节点数等，让数据计算所占的资源占用较少
UnavailableError (see above for traceback): Socket closed

原因：ps内参数数量较大导致内存溢出

解决：增加num_ps
DataLossError (see above for traceback): corrupted record at 0

原因：输入数据有问题

解决：检查数据格式和存储方式
INFO An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed

原因：参数数据过大，超出ps or bandwidth限制

解决：减少了网络层的节点数解决。但是增加ps个数并不能解决这个问题，是个bug
tensorflow.python.framework.errors_impl.UnavailableError: OS Error

原因：网上曝有两种原因，一种是ps的内存不足；另一种是环境变量GRPC设置的是epoll

解决：增大内存memory；设置环境变量为poll。即 os.environ['export GRPC_POLL_STRATEGY']='poll'
tensorflow.python.framework.errors_impl.UnknownError: ; Input/output error

原因：搜索未知

猜测：并发跑时，读取model和存储model的时间重合，即读取的文件未更新完导致？
AttributeError: '_GeneratorContextManager' object has no attribute 'run'

原因：sess=sv.managed_session()与with sv.managed_session() as sess的差异

解决：使用with

【报错篇】报错集锦

【报错篇】报错集锦

类型错误

格式错误

匹配错误

reload模型

其他错误

results matching ""

No results matching ""